I've been using AWS Rekognition to detect text written in images. AWS Rekognition charges every time I ask it to determine the text contained in an image even if the image doesn't contain any text. If there was a way to determine that an image contains text before I attempt to call AWS Rekognition, I could reduce my costs.
When I started I didn't have any existing data or information about how many of the images contained text that could be recognized. I decided to call Rekognition on every image for the first month to create some data. In that month I gathered results on 696,790 unique images. Unique in this context means that the SHA256 and ResNet50 feature vectors differed.
AWS Rekognition provides a number of API's such as object detection, face detection but the function I'm using is DetectText. This API method recognizes text inside of images larger than 80x80 pixels. DetectText returns a list of detected words and lines of words. It also returns a confidence score about each detection. This will be useful as I only want to consider highly confident and therefor accurate detections.
There are number of additional statistics that can be calculated on the results of AWS Rekognition. The best way to view them is using a histogram from pandas but first lets get some basic statistics.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import math
import pandas as pd
import os
import glob
import random
import matplotlib.image as mpimg
from matplotlib import colors
# All of the stats are stored in a file extracted from other data.
detection_results = pd.read_csv('image-stats.csv', index_col=False)
detection_results.describe()
There are outliers in width and height that should be dealt with as they will skew most of the histograms.
Lets see the histograms of the filtering by an iamge having a width of < 1000 and a height of < 1000.
detection_results[(detection_results.width < 1000) &
(detection_results.height < 1000)].hist(figsize=(20,20), bins=25)
A few interesting observations should be made about this data.
Bounding boxes can actually be larger than the source image size, if Rekognition detects text it may presume the bounding box is outside the image dimensions. I believe this may help with truncated letters.
There supposedly a 50 word limit for DetectText, but that seems to have been exceeded.
# Break down the images into two classes.
# I'm interested in images that contain more than one with with more than five characters
# and have a highest confidence greater than 99%.
#
# You may have different desired thresholds.
#
has_text = detection_results[(
(detection_results.totalWords >1) &
(detection_results.totalCharacters > 5) &
(detection_results.highestConfidence >= 99))]
missing_text = detection_results[~(
(detection_results.totalWords >1) &
(detection_results.totalCharacters > 5) &
(detection_results.highestConfidence >= 99))]
# Show the statistics for images that contain text.
has_text.describe()
has_text.hist(figsize=(20,20), bins=50)
missing_text.describe()
# Load and display some images with associated data.
def show_images(rows):
images = []
for index, img_path_record in rows.iterrows():
images.append([mpimg.imread(img_path_record['img_filename']), img_path_record])
plt.figure(figsize=(20,40))
plt.axis('off')
columns = 4
for i, record in enumerate(images):
image = record[0]
info = record[1]
plt.subplot(len(images) / columns + 1, columns, i + 1)
plt.axis('off')
plt.title("Confidence: {}, Words: {}, Chars: {}".format(round(info['highestConfidence'],2), info['totalWords'], info['totalCharacters']))
plt.imshow(image)
show_images( has_text.sample(frac=1)[0:50])
show_images(missing_text.sample(frac=1)[0:50])
Now that the images have been split into two classes, one which should be sent to Rekognition and another that shouldn't. The classes are roughly equal in size so there shouldn't be much of a class imbalance problem.
369610 images pass the classifier and 327180 do not pass the classifier. ~53% pass and ~47% don't pass.
To create the classifier I tried a few machine learning techniques from the laziest to the least lazy:
Transfer learning. Use the lower layers of an existing convolutional neural network (CNN) (i.e. ResNet50) and add additional layers. This yielded an accuracy of 86% on a smaller extracted test dataset.
Transfer learning with a gradient boosted tree. Convert the images to feature vectors using an existing CNN (i.e. ResNet50) then attempt use xgboost to create a gradient boosted tree. This can is a bit like transfer learning and trees combined. It yielded an accuracy of 88% on a smaller extracted test dataset.
Train a convolutional neural network specially for this task. This approach was most successful and will be explained in the rest of this document.
For creating the classifier I will be using Tensorflow 2.0. Lets start by importing all of the things that will be needed.
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import pickle
import pathlib
import math
import datetime
import os
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import precision_recall_curve
%matplotlib inline
sns.set(font_scale=2)
AUTOTUNE = tf.data.experimental.AUTOTUNE
tf.__version__
All of the images are located in two directories, ./image-classifier-dataset/true and ./image-classifier-dataset/false respectively based on if the image passed the classification criteria.
I've chosen to split the data into training, test and validation sets.
The test set will be 10% of the data with validation being 1%.
This is a pretty liberal allocation but since I've trained the model multiple times I know that I'd rather use most of my data in the training set rather than the test set.
from __future__ import absolute_import, division, print_function, unicode_literals
from tensorflow.keras import datasets, layers, models
from sklearn.model_selection import train_test_split
import tensorflow as tf
import pickle
import pathlib
import math
import datetime
import os
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import precision_recall_curve
sns.set(font_scale=2)
AUTOTUNE = tf.data.experimental.AUTOTUNE
tf.__version__
# The images have been saved to a directory structure that looks like
# image-classifier-dataset/
# true/ - Images that should pass the classifier
# false/ - Images that should not pass the classifier
root = pathlib.Path('./image-classifier-dataset/')
all_image_paths = [str(path) for path in list(root.glob('*/*'))]
all_image_paths.sort()
# Create the integer labels for each image by looking at the parent directory
# name of the image.
#
# 0 will represent that the image should not pass the classifier
# 1 will represent that the image should pass the classifier
all_image_labels = [0 if pathlib.Path(path).parent.name == 'false' else 1
for path in all_image_paths]
# Determine the size of the test set
test_size = math.floor(len(all_image_paths)*0.10)
# Determine the size of the validation set.
validation_size = math.floor(len(all_image_paths)*0.01)
# For reproducable results.
RANDOM_SEED = 2019
training_all_image_paths, test_all_image_paths, training_all_image_labels, test_all_image_labels = train_test_split(
all_image_paths, all_image_labels, test_size=test_size, random_state=RANDOM_SEED)
training_all_image_paths, validation_all_image_paths, training_all_image_labels, validation_all_image_labels = train_test_split(
training_all_image_paths, training_all_image_labels, test_size=validation_size, random_state=RANDOM_SEED)
print("Training size: {} Test size: {} Validation size: {}".format(
len(training_all_image_paths),
len(test_all_image_paths),
len(validation_all_image_paths)))
# Save the contents of the test, training and validation sets.
pickle.dump(training_all_image_paths, open("training_image_paths.p", "wb"))
pickle.dump(test_all_image_paths, open("test_image_paths.p", "wb"))
pickle.dump(validation_all_image_paths, open("validation_image_paths.p", "wb"))
The images themselves have already been scaled to be the size of 224 by 224 pixels. This dimension was convenient since it was the input dimension to the ResNet50 pretrained CNN which was used for the initial experiments with transfer learning.
Once the image has been loaded, the pixel values themselves are mapped from a range of 0-255 to a float that is zero centered and ranges from -1 to 1. This is a very standard thing to do with Tensorflow when using images.
The Tensorflow documentation recommends that image pre-processing be performed once and cached to speed up the image pipeline. I've found that the size needed to serialize all of the preprocessed images isn't reasonable.
The hardware that I will be using to train this neural network offers multiple CPU cores, in my experience so far it seems that Tensorflow can adequately keep the GPU saturated with data while performing the image pipeline processing.
# Load the image, resize if necessary, then zero center and norm between -1 and 1.
# this is pretty standard for neural networks.
def load_and_preprocess_image(path):
image = tf.io.read_file(path)
image = tf.image.decode_jpeg(image, channels=3)
image = tf.image.resize(image, [224, 224])
image /= 127.5
image -= 1.
return image
To perform training we need to create a tf.Dataset that combines the image along with the actual expected label (or classification value 1 for true 0 for false).
Additionally for training and validation the tf.Datasets should form batches of examples. Batches are collections of images and labels at a fixed size which are used to adjust the gradients of the neural network. For now I've chosen a batch size of 128 images, this means that the gradient for learning will be averaged over the results of 128 images before weights are adjusted. Choosing a smaller batch size may allow more accuracy but does increase the necessary training time.
# Determine the batch size for training. How many images will be considered
# before adjusting the weights in the direction of the gradient.
BATCH_SIZE = 128
test_path_ds = tf.data.Dataset.from_tensor_slices(test_all_image_paths)
test_image_ds = test_path_ds.map(
load_and_preprocess_image, num_parallel_calls=AUTOTUNE)
test_label_ds = tf.data.Dataset.from_tensor_slices(
tf.cast(test_all_image_labels, tf.int8))
training_path_ds = tf.data.Dataset.from_tensor_slices(training_all_image_paths)
training_image_ds = training_path_ds.map(
load_and_preprocess_image, num_parallel_calls=AUTOTUNE)
training_label_ds = tf.data.Dataset.from_tensor_slices(
tf.cast(training_all_image_labels, tf.int8))
validation_path_ds = tf.data.Dataset.from_tensor_slices(
validation_all_image_paths)
validation_image_ds = validation_path_ds.map(
load_and_preprocess_image, num_parallel_calls=AUTOTUNE)
validation_label_ds = tf.data.Dataset.from_tensor_slices(
tf.cast(validation_all_image_labels, tf.int8))
validation_image_label_ds = tf.data.Dataset.zip(
(validation_image_ds, validation_label_ds))
test_image_label_ds = tf.data.Dataset.zip((test_image_ds, test_label_ds))
# Training neural nets benefit from being presented random batches, but shuffling
# all of the data could be very expensive in the aspect of time, so just shuffle
# 128 times the batch size.
training_image_label_ds = tf.data.Dataset.zip((training_image_ds, training_label_ds)).shuffle(
buffer_size=BATCH_SIZE*128, reshuffle_each_iteration=True).repeat().batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
validation_batches = validation_image_label_ds.batch(BATCH_SIZE)
test_batches = test_image_label_ds.batch(
BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
Deciding on the neural network architecture is a problem that doesn't have a clear path from start to finish. Choices need to be made about the number of layers, type of layers, the parameters of the layers, activation functions and the choice of optimizers which determine the learning rate schedule.
If the neural network has too many parameters it will result in overfitting of the training set, if it has too few it will not be able to learn the generic patterns effectively.
from tensorflow.keras import datasets, layers, models
model = models.Sequential([
layers.Conv2D(24, 3, padding='same', activation=layers.ELU(
alpha=1.0), input_shape=(224, 224, 3)),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation=layers.ELU(alpha=1.0)),
layers.BatchNormalization(),
layers.Conv2D(64, 5, padding='same', activation=layers.ELU(alpha=1.0)),
layers.MaxPooling2D(),
layers.Conv2D(64, 7, padding='same', activation=layers.ELU(alpha=1.0)),
layers.MaxPooling2D(),
layers.BatchNormalization(),
layers.Conv2D(64, 5, padding='same', activation=layers.ELU(alpha=1.0)),
layers.MaxPooling2D(),
layers.Conv2D(8, 3, padding='same', activation=layers.ELU(alpha=1.0)),
layers.MaxPooling2D(),
layers.BatchNormalization(),
layers.Flatten(),
layers.Dense(64, activation=layers.ELU(alpha=1.0)),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer="rmsprop",
loss='binary_crossentropy',
metrics=[
"accuracy"
]
)
model.summary()
The model is now defined it is now time to train the network using the training set of the images.
I've chosen to only train for 7 epochs since in previous testing that was sufficient for the level of accuracy I needed.
steps_per_epoch = (math.ceil(len(training_all_image_paths)/BATCH_SIZE))
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir, histogram_freq=1)
checkpoint_path = "train.weights.{epoch:02d}-{val_loss:.2f}.hdf5"
checkpoint_dir = os.path.dirname(checkpoint_path)
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
save_weights_only=True,
verbose=1)
model.fit(training_image_label_ds,
epochs=7,
steps_per_epoch=steps_per_epoch,
validation_data=validation_batches,
callbacks=[cp_callback, tensorboard_callback],
)
model.save('train.h5')
The model is now trained on all of the data for 7 epochs it is now time to evaluate its performance against the set of test images and labels that the neural net was not trained on.
#y_true = test_all_image_labels
#y_scores = model.predict(test_batches)
import pickle
from sklearn.metrics import precision_recall_curve
y_true = pickle.load(open("actual-truth.p", "rb" ))
y_scores = pickle.load(open("predictions.p", "rb" ))
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
"""
Modified from:
Hands-On Machine learning with Scikit-Learn
and TensorFlow; p.89
"""
plt.figure(figsize=(5, 5))
plt.title("Precision and Recall Scores as a function of the decision threshold")
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.ylabel("Score")
plt.ylim([0.90, 1.0])
plt.xlim([0.00, 0.85])
plt.xlabel("Decision Threshold")
plt.legend(loc='best')
plot_precision_recall_vs_threshold(precision, recall, thresholds)
The output of the neural network is a probability between 0 and 1 that indicates that is interpreted as the probability that the image passed the classification criteria. The probability is continuous between 0 and 1 but all of the test data is labeled 0 or 1. A threshold value must be chosen that converts the probability into 0 or 1 values, for example if the threshold was 0.5 it would mean if the probability if greater than or equal to 0.5 the output will be considered to be 1 otherwise the value is 0.
The choice of the threshold should be driven by the desired performance of the neural network with regards to precision and recall. If the threshold is chosen to be a high number such as 0.90 it requires the neural network to produce a high probability for images to pass, this may result in images that would match the criteria be classified as not matching the criteria. This is known as a false positive.
As a counterpoint if the threshold is too low images will be passed by the classifier criteria when they should not. This in turn means that more calls to AWS Rekognition will be made than necessary. But it also means that there will be a smaller number of images that will be false positives.
The choice of the threshold is a tradeoff between there opposing situations.
I'd like to minimize false negatives while not sacrificing overall accuracy. I'd like to have recall value >= 0.95. As such I have chosen a threshold value of: 0.4325.
from sklearn import metrics
import seaborn as sns
THRESHOLD = 0.4325
LABELS = ['False', 'True']
max_test = y_true
max_predictions = [1 if pred >= THRESHOLD else 0 for pred in y_scores]
confusion_matrix = metrics.confusion_matrix(max_test, max_predictions)
plt.figure(figsize=(5, 5))
sns.heatmap(confusion_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d", annot_kws={"size": 20});
plt.title("Confusion matrix", fontsize=20)
plt.ylabel('Actual label', fontsize=20)
plt.xlabel('Predicted label', fontsize=20)
plt.show()
values = confusion_matrix.view()
error_count = values.sum() - np.trace(values)
precision = values[0][0]/(values[0][0]+values[0][1])
recall = values[0][0]/(values[0][0]+values[1][0])
print("Precision: %", precision)
print("Recall: %", recall)
print("Accuracy: %", 1 - (error_count/len(y_true)))
print("Error Rate: %", error_count/len(y_true))
This means that the model will only have about a ~5% false positive rate. Meaning it will not send images to Rekognition 5% of the time that it should. On the other hand it will be accurate 91.6% of the time when it does send images to Rekognition.
How would have this performed on the first dataset?
total_images = (len(missing_text) + len(has_text))
true_text_percentage = len(has_text) / (len(missing_text) + len(has_text))
false_text_percentage = len(missing_text) / (len(missing_text) + len(has_text))
# The base percentage of the population, plus the percentage of false positives, subtracting the rate of false negatives.
total_call_percentage = true_text_percentage + (1-precision) - (1 - recall)
predicted_calls = math.ceil(total_call_percentage * total_images)
print("Predicted calls to Rekognition: {}, predicted Rekognition calls saved: {}".format(predicted_calls, total_images-predicted_calls))
print("Savings rate: {}% of unnecessary calls".format(round((total_images-predicted_calls)/total_images*100, 2)))
print("Actual false rate: {}%".format(round(false_text_percentage*100, 2)))
The model will have about 43.6% of calls to Rekognition, which is better than having no model and making all calls. The test set has a rate of 46.96% of images without text, so this is an acceptable level of performance to me.
Now that the model has been trained and the threshold has been selected. The model should be saved so that it can be used by Tensorflow serving to process requests.
The code below will load the model from train.h5 and then save it in a directory called text-detect as version 1 of the model. If there were multiple versions of the same model you could increment the version number.
import keras.backend.tensorflow_backend as K
K.set_session
model = tf.keras.models.load_model(
'train.h5', custom_objects={'ELU': tf.keras.layers.ELU})
tf.keras.experimental.export_saved_model(
model, './text-detect/1', custom_objects={'ELU': tf.keras.layers.ELU})
There is some configuration for Tensorflow serving to work with multple models. I've added the text detection model such that my model.conf looks like this.
model_config_list: {
config: {
name: "resnet50",
base_path: "/models/resnet50",
model_platform: "tensorflow"
},
config: {
name: "xception",
base_path: "/models/xception",
model_platform: "tensorflow"
},
config: {
name: "text-detect",
base_path: "/models/text-detect",
model_platform: "tensorflow"
}
}
Next start the Tensorflow serving docker container like this.
#!/bin/bash
export MODELDIR=[FILL IN WITH YOUR MODEL DIR]
docker run --rm \
-p 8501:8501 \
-v "$MODELDIR/resnet-classifier:/models/resnet50" \
-v "$MODELDIR/xception-classifier:/models/xception" \
-v "$MODELDIR/text-detect:/models/text-detect" \
-v "$MODELDIR/model.config:/model.config" tensorflow/serving --enable_batching --model_config_file=/model.config
The model now works but the input is a 224 by 224 pixel matrix with a separate 32 bit float input for each pixel's red, green and blue value. This is quite unweildly especially since I'm using the HTTP interface to the Tensorflow model server.
To increase efficiency I'd like to convert the model to be part of an estimator that can take a Base64 encoded JPEG image as input. This means that there will be less data sent to the model server since a JPEG is a compressed format. To do this I use code:
import os
import tensorflow as tf
import keras.backend.tensorflow_backend as K
K.set_session
model = tf.keras.models.load_model(
'text-detect-golden.h5', custom_objects={'ELU': tf.keras.layers.ELU})
WIDTH = 224
HEIGHT = 224
CHANNELS = 3
def image_preprocessing(image):
image = tf.expand_dims(image, 0)
image = tf.image.resize(
image, [WIDTH, HEIGHT], method=tf.image.ResizeMethod.BILINEAR)
image = tf.squeeze(image, axis=[0])
image = tf.cast(image, dtype=tf.float32)
image = (image - 127.5) / 127.5
return image
def serving_input_receiver_fn():
def prepare_image(image_str_tensor):
image = tf.image.decode_jpeg(image_str_tensor, channels=CHANNELS)
return image_preprocessing(image)
# TensorFlow will have already converted the list string into a numeric tensor
# make sure this runs conversion from JPEG to float on the CPU and not the GPU
with tf.device('/cpu:0'):
input_ph = tf.compat.v1.placeholder(tf.string, shape=[None])
images_tensor = tf.map_fn(
prepare_image, input_ph, back_prop=False, dtype=tf.float32)
return tf.estimator.export.ServingInputReceiver(
{'conv2d_input': images_tensor},
{'image': input_ph})
estimator = tf.keras.estimator.model_to_estimator(
keras_model=model,
custom_objects={'ELU': tf.keras.layers.ELU}
)
estimator.export_saved_model(
"./text-detect-estimator/1/",
serving_input_receiver_fn=serving_input_receiver_fn,
)
By replacing the original model with the estimator and serving it using the model server I can now test the results by passing a Base64 encoded JPEG image in JSON via curl.
{
"signature_name": "serving_default",
"instances": [
{ "b64": "/9j/4AAQSkZJRgABAQAAAQABAAD/..." }
]
}
Perform the prediction using:
curl -X POST -d @test.json http://localhost:8501/v1/models/text-detect:predict
Making the change to use a Keras estimator that decodes the JPEG increased the inference performance. Originally the fastest inference took 150ms per single image. Using the estimator the inference time is now 40ms using local CPU. This is using aggressive batching by the model server.
If I were to use a GPU the inference time could be as low as 4ms with aggressive batching of images.